Modular vector processor architecture targeting at data-level parallelism
نویسندگان
چکیده
Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. We present an innovative architecture for a VP which separates the path for performing data shuffle and memoryindexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In our lane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. The proposed VP, which is developed in VHDL and prototyped on an FPGA, serves as a coprocessor for one or more scalar cores. Benchmarking shows that our VP can achieve very high performance. For example, it achieves a larger than 1500-fold speedup in the color space converting benchmark compared to running the code on a scalar core. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. Compared to running the benchmark on a VP without the shuffle engines, the speedup is 5.92 and 7.33 for the 64-point FFT without and with compiler optimization, respectively. Compared to runs on the scalar core, the achieved speedups for this benchmark are 52.07 and 110.45 without and with compiler optimization, respectively.
منابع مشابه
Instruction Level Parallelism In Arm Processor
Now, single-processor performance improvement has dropped Limited amount of exploitable instruction-level parallelism in Load-store ISA: ARM, MIPS. ARM Processors In early 2015, ARM announced a suite of IP for Premium Mobile designs, capability deepens the window for instruction level parallelism. Abstract: Advanced modern processors support Single Instruction Multiple Modular arithmetic, SIMD-...
متن کاملVLIW-Based Processor for Executing Multi-Scalar/Vector Instructions
This paper proposes new processor architecture for data-parallel applications based on the combination of VLIW and vector processing paradigms. It uses VLIW architecture for processing multiple independent scalar instructions concurrently on parallel execution units. Data parallelism is expressed by vector ISA and processed on the same parallel execution units of the VLIW architecture. The prop...
متن کاملCompiling for Increasing On-chip Parallelism
It becomes a trend that microprocessor companies are adding more and more parallelism on a chip to increase performance per chip. At the fine granularity level, vector instruction sets are added. While at the coarse granularity level, multiple cores are put on the same chip. This trend presents a challenge for application developers as well for compiler developers: how to exploit the power of t...
متن کاملVector Processing on Scalar Architectures
A 64-bit processor must necessarily implement substantial parallelism in the movement and processing of data. Data movement is inherently parallel at the bit level, and many operations implemented in the integer unit exhibit bit-level parallelism. A microvector is an array of small data items or bit elds packed into a single word. Scalar operations performed on a microvector can be used to impl...
متن کاملHardware/Compiler Co-development for an Embedded Media Processor
Embedded and portable systems running multimedia applications create a new challenge for hardware architects. The microprocessor needed for such systems is a merged general-purpose processor and digital-signal processor, with the programmability the former and the performance and power budget of the latter. This paper presents the co-development of the instruction set, the hardware, and the com...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Microprocessors and Microsystems - Embedded Hardware Design
دوره 39 شماره
صفحات -
تاریخ انتشار 2015